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Abstract 

In this paper, we propose a method for activity recognition from videos based on sparse 
local features and hypergraph matching. We benefit from special properties of the temporal 
domain in the data to derive a sequential and fast graph matching algorithm for GPUs. 

Traditionally, graphs and hypergraphs are frequently used to recognize complex and often 
non-rigid patterns in computer vision, either through graph matching or point-set matching 
with graphs. Most formulations resort to the minimization of a difficult discrete energy function 
mixing geometric or structural terms with data attached terms involving appearance features. 
Traditional methods solve this minimization problem approximately, for instance with spectral 
techniques. 

In this work, instead of solving the problem approximatively, the exact solution for the 
optimal assignment is calculated in parallel on GPUs. The graphical structure is simplified 
and regularized, which allows to derive an efficient recursive minimization algorithm. The 
algorithm distributes subproblems over the calculation units of a GPU, which solves them in 
parallel, allowing the system to run faster than real-time on medium-end GPUs. 

Keywords: Activity recognition Graph matching Parallel algorithms GPU Video analysis 

1 Introduction 

Many computer vision problems can be formulated as graphs and associated algorithms, since 
graphs provide a structured and flexible way to inject spatial and structural relationships into match¬ 
ing algorithms. In this paper, it is employed for recognition and localization of actions in videos. 
The task to detect and localize activities in time and in space and to classify them requires dealing 
with large quantities of video data in real time. 

We formulate the activity recognition task as a correspondence problem between sparse features 
of short duration model actions and those of longer duration scene videos. The articulated nature of 
human motions makes it impossible to employ rigid transformations and methods (like RANSAC 
Ifl4l l. while graph matching is able to cope with such non-rigid deformations. The proposed method 
structures space-time interest points into graphs or hypergraphs using proximity information, as is 
frequently done in the context of visual recognition (object or activity recognition). The optimal 
matching between a model video and a scene video is cast as a combinatorial problem to minimize 
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a function containing terms measuring differences in appearance features as well as terms adressing 
space-time geometric distortions in the matching. 

Despite advances in graph matching due to its popularity and effectiveness, it still remains 
a challenging task. The computational complexity renders the use of the exact problem on data 
having a large number of nodes, e.g. video data, intractable in practice. Formulations useful in 
vision are known to be NP-hard: while the graph isomorphism problem is conjectured to be solvable 
in polynomial time, it is known that exact subgraph matching is NP-hard fid! , and so is subgraph 
isomorphism fl6l . Graph matching solutions in practical problems therefore are of approximate 
nature. 

Recently, computer vision has immensely benefited from development of general purpose com¬ 
putation on graphical processing units (GPUs). Prominent examples are classification in various 
applications (e.g. pose estimation lf37l ) and convolution operations, for instance in deep learning 
architectures ll22ll . feature tracking and matching l38l and patch based image operations, as for in¬ 
stance inpainting l l43ll . While it is straightforward to profit from parallel architectures if the problem 
is inherently parallelizable and data-oriented, structured problems are often characterized by com¬ 
plex dependencies which make parallelization of the algorithms difficult. The main inpediments 
are irregular memory access, tree-based search structures, variable computation and memory units 
requirement tfl8ll . 

Whereas most existing work solves the matching problem approximatively, in this work the 
exact global minimum is calculated, which is made possible by two properties: 

• We benefit from two sources, first from the application itself (activity recognition in videos) 
and the fact that the data are embedded in space time, in particular from specific properties 
of the time dimension; 

• We approximate the graphical structure and therefore solve a simplified problem exactly and 
efficiently. The work thus falls into the category of methods calculating the exact solution 
for an approximated model, unlike methods calculating an approximate solution of an exact 
problem. In this sense, it can be compared to Q, where the original graph of a 2D object 
is replaced by a k-tree allowing exact minimization by the junction tree algorithm. Our 
solution is different in that the graphical structure is not created randomly but is derived from 
the temporal dimensions of video data. 

The linearly ordered nature of actions in time allows matching to proceed through a recursive 
algorithm over time, which is sequential in nature. Flowever, the subproblems corresponding to 
calculations associated for individual time instants involve iterations over the domains of discrete 
variables, which are independent. Faster than real-time performance is achieved through a parallel 
solution of these subproblems on GPUs. 

1.1 Related work 

Graphs and graph matching in computer vision — In computer vision, graph matching has been 
mainly applied for object recognition, but some applications to action recognition problems have 
recently been reported. In this context, graphs are frequently constructed from sparse primitives 
like space time interest points ||39lfT7l [5l or from regions and adjacency information gathered from 
over-segmented videos (6). Common matching strategies in this context are off-the-shelf spectral 
techniques ll39ll or dynamic time warping on strings of graphs fl7l . 

Other graph related work, albeit not necessarily by matching, is based on stochastic Kronecker 
graphs Boll for modeling the features pertaining to a specific activity class, chain-graphs |[46l . 
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Figure 1: Videos are represented through space time interest points: (a) examples of successful 
matches of space time points from the model video (left) to the scene video (right); (b) model 
videos are structured into graphs with a specific structure using proximity information. 


weighted directed graphs ||28l where nodes are assigned to frames with salient postures or silhou¬ 
ettes that are shared among different action categories. Here, edges encode the transition proba¬ 
bilities between these postures. Salient postures arc used for single-view and multi-view action 
recognition in ll30ll . In lf30l . different actions are linked by chain graphs, called as Action Nets. In 
|[9l , the test video is first divided into several subvolumes and the maximum subgraph is searched 
through branch-and-cut algorithms. 

Graph matching and graphs are not the only technique successfully employed for activity recog¬ 
nition. A review of the state of the art of this problem is beyond the scope of this paper, we refer 
the reader to |[Q for a recent survey. 

Solving the graph matching problem — Two different formulations dominate the literature 
on graph matching: (i) Exact matching: a strictly structure-preserving correspondence between 
the two graphs or at least between their respective parts is searched; (ii) Inexact matching, where 
compromises in the correspondence are allowed in principle by admitting structural deformations 
up to some extent. Matching proceeds by minimizing an objective (energy) function. 

Most recent papers on graph matching in the computer vision context are based on inexact 
matching of valued graphs, i.e., graphs with additional geometric and/or appearance information 
associated with nodes and/or edges. Practical formulations of this problem are known to be NP- 
hard ll4Tll . which makes approximations unavoidable. Two different strategies are frequently used: 
(i) Calculation of an approximate solution of the minimization problem; (ii) Calculation of the exact 
solution, most frequently of an approximated model. 

Approximate solution — A well known family of methods solve a continuous relaxation of 
the original combinatorial problem. Zass and Shashua P3Tl presented a soft hypergraph matching 
method between sets of features that proceeds through an iterative successive projection algorithm 
in a probabilistic setting. They extended the Sinkhorn algorithm li20l . which is used for soft assign¬ 
ment in combinatorial problems, to obtain a global optimum in the special case when the two graphs 
have the same number of vertices and an exact matching is desired. They also presented a sam¬ 
pling scheme to handle the combinatorial explosion due to the degree of hypergraphs. Zaslavskiy 
et al. 1441 employed a convex-concave programming approach to solve the least-squares problem 
over the permutation matrices. More explicitly, they proposed two relaxations to the quadratic as- 
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signment problem over the set of permutation matrices which results in one quadratic convex and 
one quadratic concave optimization problem. They obtained an approximate solution of the match¬ 
ing problem through a path following algorithm that tracks a path of local minimum by linearly 
interpolating convex and concave formulations. 

A specific form of relaxation is done by spectral methods, which study the similarities between 
the eigen-structures of the adjacency or Laplacian matrices of the graphs or of the assignment matri¬ 
ces corresponding to the minimization problem formulated in matrix form. In particular, Duchenne 
et al. lil2Tl generalized the spectral matching method from the pairwise graphs presented in ll26l to 
hypergraphs by using a tensor-based algorithm to represent affinity between feature tuples, which 
is then solved as an eigen-problem on the assignment matrix. More explicitly, they solved the re¬ 
laxed problem by using a multi-dimensional power iteration method, and obtained a sparse output 
by taking into account Zi-norm constraints instead of the classical -norm. Leordeanu et al. E71 
made an improvement on the solution to the integer quadratic programming problem in lfl2ll by 
introducing a semi-supervised learning approach. In the same vein, Lee et al. G5l approached this 
problem via the random walk concept. 

Another approach is to decompose the original discrete matching problem into subproblems, 
which are then solved with different optimization tools. A case in point, Torresani et al. iHTfl 
solved the subproblems through graph-cuts, Hungarian algorithm and local search. Lin et al. ll29l 
first determined a number of subproblems where each one is characterized by local assignment can¬ 
didates, i.e., by plausible matches between model and scene local structures. For example, in action 
recognition domain, these local structures can correspond to human body parts. Then, they built a 
candidacy graph representation by taking into account these candidates on a layered (hierarchical) 
structure and formulated the matching problem as a multiple coloring problem. Finally, Duchenne 
et al. IT3l extended one dimensional multi-label graph cuts minimization algorithm to images for 
optimizing the Markov Random Fields (MRFs). 

Approximate graphical structure — An alternative approach is to approximate the data model, for 
instance the graphical structure, as opposed to applying an approximate matching algorithm to the 
complete data model. One way is to simplify the graph by filtering out the unfruitful portion of 
the data before matching. For example, a method for object recognition has been proposed by 
Caetano et al. 0, which approximates the model graph by building a k-tree randomly from the 
spatial interest points of the object. Then, matching was calculated using the classical junction tree 
algorithm Il24l known for solving the inference problem in Bayesian Networks. 

A special case is the work by Bergthold et al. 0, who perform object recognition using fully 
connected graphs of small size (between 5 and 15 nodes). The graphs can be small because the 
nodes correspond to semantically meaningful parts in an object, for instance landmarks in a face 
or body parts in human detection. A spanning tree is calculated on the graph, and from this tree 
a graph is constructed describing the complete state space. The A* algorithm then searches the 
shortest path in this graph using a search heuristic. The method is approximative in principle, as 
hypotheses are discarded due to memory requirements. However, for some of the smaller graphs 
used in certain applications, the exact solution can be obtained. 

Parallel graph matching — Parallel algorithms have been proposed for the graph matching 
problem for some time. Although both exact solutions and approximate solutions, can be parallized 
in principle, most of the existing parallel algorithms have been proposed for approximate solution. 
Many spectral methods, which are approximative, can be naturally ported to a GPU architecture, 
as the underlying numerical algorithms require matrix operations. Similar matrix operations are 
employed in ll33l . where multiple graphs are matched using graduated assignment on GPUs. 

Matching two graphs can also be performed by searching the maximum common subgraph 
of the two graphs ifTTl . a problem which can be transformed (in linear time) to the problem of 
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searching the maximum clique of a graph. For this problem parallel and GPU algorithms do exist, 
albeit not very efficient ones lfl8l . 

Parallel and GPU algorithms for bi-partite matching have been proposed recently EBliTT l. Bi¬ 
partite matching is, however, different and less difficult, as polynomial time algorithms are known 
for them. These algorithms alternate bidding (proposed assignments) kernels and assignment ker¬ 
nels on the GPU as well as convergence tests on the CPU. A similar problem, unfortunately also 
called graph matching, matches neighboring vertices of a single graph under unicity constraints. 
In other terms, a matching or independent edge set in a graph is a set of edges without common 
vertices. GPU algorithms have been proposed for this kind of problem 621. 

2 Problem Formulation 

Detecting, recognizing and localizing activities in a video stream is cast as a matching problem 
between a scene video, which can be potentially long or short if the data is processed block-wise in 
a stream, and a dictionary of relatively much shorter model videos describing the set of activities to 
recognize. Matching is done by pairs, solving a correspondence problem between the scene video 
and a single model video at a time. We formulate the problem as a particular case of the general 
correspondence problem between two point sets with the objective of assigning points from the 
model set to points in the scene set, such that some geometrical invariance is satisfied. In particular, 
videos are represented as space-time interest points — see Figure [ljt for examples of successful 
matchings between model point sets and scene point sets. Each point is described by its location 
in space-time and appearance features, i.e., a descriptor locally representing the space-time region 
around the point. 

The M points of the model are organized as a hypergraph Q={V,£}, where V is the set of 
nodes (corresponding to the points) and £ is the set of edges. Let us recall that hypergraphs are 
a generalization of graphs, where edges, often called hyperedges, can link any number of nodes, 
generally >2. The set of scene nodes is not structured. 

Each node i of the model graph is assigned a discrete variable Zi, i = 1..M, which represents 
the mapping from the i lh model node to some scene node, and can take values in 1 ..S', where S 
is the number of scene nodes. We use the shorthand notation z to denote the whole set of map 
variables z*. A solution of the matching problem is given through the values of the z t , where z,=j, 
i=l..M, is interpreted as model node i being assigned to scene node j = 1 ..S'. Each combination 
of assignments z evaluates to an energy value in terms of the following energy function E(z ): 



( 1 ) 


(■ i,j,k)e£ 


Here, U is a data attached term taking into account the distance between appearance features of 
point i and its assigned point z. t , D is the geometric distortion between the space-time triangle 
associated with hyperedge ( i,j, k) and the triangle associated with (z,;, Zj, z/j, and Ai and A 2 are 
weights. For convenience, dependencies on all values over which we do not optimize have been 
omitted from the notation. 

U is defined as the Euclidean distance between the appearance features of assigned points in the 
case of a candidate match, and it takes a penalty value W d for dummy assignments which handle 
situations where a model point is not found in the scene: 



( 2 ) 
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fi and f being respectively the feature vector of model point i, and the feature vector of scene 
point z t . 

The D term is based on angles. Since our data is embedded in space-time, angles include 
a temporal component not related to scale changes induced by zooming. We therefore split the 
geometry term I) into a temporal distortion term D t and a spatial geometric distortion term D 9 , 
weighted by a parameter A 3 : 

Zj 1 Zfc) — D (zi,Zj,z k ) + A 3 D 9 (zi,Zj,z k ) (3) 

where the temporal distortion I) 1 is defined as time differences over two pairs of nodes of the 
triangle: 

D t (z i , zj, z k ) = A (i,j) + A (j, k ) (4) 

with: 

A (h 3) = |(*(«) - t(j)) ~ - t'(zj))\ (5) 

Here, A (i,j) is the time distortion due to the assignment of model node pair (i. j) to scene node 
pair (zi, Zj). The temporal distortion term penalizes the discrepancy in the extent of time between 
model node pairs and the corresponding scene node pairs. The model node pairs should not be 
too close or too far from each other likewise the scene node pairs. Finally, D 9 is defined over 
differences of angles: 


D 9 (zi,zj,z k ) 


k) - a!{zi,zj,z k ) 
a(j,i,k ) - a'(zj,zi,z k ) 


( 6 ) 


Here, a(i,j,k) and a'(zi, Zj, z k ) denote the angles subtended at point j for, respectively, model 
triangle indexed by (■ i,j,k ) and scene triangle indexed by ( Zi,Zj,z k ). |j.|| is the L2 norm. The 
difference between angles takes into account the circular domain of angles. 


2.1 Approximations 

In the context of activity recognition, the geometric data are embedded in space-time. We make 
the following assumptions relative to the temporal domain to derive an efficient minimization algo¬ 
rithm: 

Assumption 1: Causality — While objects (and humans) can undergo arbitrary geometrical 
transformations like translation and rotation, which is subsumed by geometrical matching invari¬ 
ance in our formulation, human actions can normally not be reversed. In a correct match, the 
temporal order of the points should be retained, which can be formalized as follows 

Vi, j : t(i) < t(j) t'(zi) < t'(zj) (7) 

where the notation t(i) stands for the temporal coordinate of model interest point i, and t' ( z t ) stands 
for the temporal coordinate of scene interest point z%. 

Assumption 2: Temporal closeness — Another reasonable assumption is that the extent of 
time warping between model and scene time axes must be limited. In other words, two points which 
are close in time must be close in both the model set and the scene set. Since our graph is created 
from proximity information (time distances have been thresholded to construct the hyperedges), 
this can be formalized as follows: 

V i,j,k G £ : | t'(zi) - t'(zj)\ < T A \t'(zj) - t'(z k )\ < T (8) 

where T is a parameter. 
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We have shown in JS), that this problem can be solved in polynomial time if the data are 
embedded in space-time, as opposed to the general class of problems, which is NP-hard BT1 . 

Due to the sequential nature of activities, graphs obtained from data embedded in space-time 
are generally elongated in time, i.e. the graphical structure is heavily influenced by the temporal 
order of the nodes. This is particularly true in the case of graphs constructed from space time 
interest points, which are commonly very sparse. Typical interest point detectors extract only few 
such points per frame — typically between 0 and 5 lf23l . We take advantage of this observation to 
restrict the model graph by keeping only a single interest point per model frame. This is done by 
choosing the most salient one, i.e., the one with the highest confidence of the interest point detector. 

We also restrict the set £ of model graph edges to connections of each model point i to its two 
immediate frame-wise predecessors i — 1 and i — 2 as well as to its two immediate successors i + 1 
and i + 2. This creates a planar graph with triangular structure, as illustrated in Figure [TJ>. A video 
is described by the way this planar graph twists in space time, as well as the appearance features 
associated with each node. According to the visual content of a video, there may be frames which 
do not contain any space time interest points, and therefore no nodes in the model graph. These 
empty frames are not taken into account, e.g. in Equation ([8]), when triplets of consecutive frame 
numbers are considered. 

2.2 Minimization 

The neighborhood system of this simplified graph can be described in a very simple way using the 
node indices of the graph, similar to the dependency graph of a second order Markov chain. The 
general energy given in 0 can therefore be expressed simpler as follows: 

M M 

EM = E U(zi) T Zi—i, Zi—2) (9) 

i= 1 i =3 

where we also have absorbed Ai and A 2 into U and D, respectively. The elongated form of the 
graph allows us to derive an efficient inference algorithm for calculating z = argmin z E(z) based 
on the following recursion: 

ati(zi- 1 ,^— 2 ) = min 

Zi 

with the initialization 

a M (z M -i,z M - 2 ) = min [U(z M ) + D(z M , z M - 1 , z M - 2 )] (11) 

Z M 

During the calculation of the trellis, the arguments of the minima in Equation ( |TT)| ) are stored in 
a table /3i(zi-±, Zi- 2 ). Once the trellis is completed, the optimal assignment can be calculated 
through classical backtracking: 

Zi = Pi(Zi-l,Zi-2), (12) 

starting from an initial search for z\ and z 2 : 

(zi,z 2 ) = arg min [U(zi) + U(z 2 ) + a 3 (z 2 , zx)\ (13) 

Zl,Z 2 

The computational complexity and the memory complexity of the algorithm are defined by the 
trellis, a MxSxS matrix, where each cell corresponds to a possible value of a given variable. The 


U(zi) + D(zi,Zi-i,Zi- 2 ) + a i+ i(zi, Zi-i) 


( 10 ) 
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calculation of each cell requires to iterate over all S possible combinations of z t . It is easy to see 
that the computational complexity is 0(S 3 M ) and that the memory complexity is 0(S 2 M). 

Exploiting the different assumptions on the spatio-temporal data given above, the computational 
complexity can be decreased further. A large amount of combinations from the trellis can be pruned 
applying the following constraints: 

• given variable z t , all values for its predecessors | and z ,_2 must be necessarily before Zi, 
i.e. lower. 

• given variable zi, we will allow a maximum number of T possibilities for the values of the 
successors Zj+i, Zi+ 2 , which are required to be close. 

These pruning measures decrease the complexity to 0(SMT 2 ), where T is a small constant mea¬ 
sured in the number of frames. Since T is a small constant, the computational complexity therefore 
becomes linear on the number of points in the scene: O(SM). 

Let us note that no such prior pruning is applied to the scene frames, which therefore may 
contain an arbitrary number of points. At a first glimpse it could be suspected that the single-point- 
per-frame approach could be too limited to adequately capture the essence of an action sequence. 
Experiments have shown, however, that the single chain performs surprisingly well. It should be 
noted again, that no restrictions have been imposed on the scene, in other words, none of the scene 
points have been eliminated. 

3 A parallel solver for GPUs 

Solving the matching problem requires computing Equations ( [IT] ), ( fTT)| ), ( |T~2| ) and ( [T3| ). However, 
the computational and memory complexity are dominated by the requirement to solve ( fT0| ) for dif¬ 
ferent indices i and for different combinations of z,_ 1 and 2 , which boils down to filling a three 
dimensional array with values, taking into account certain dependencies between these values. In 
the following section we will present a parallel solver for this problem designed for modern GPUs. 
Although the system has been implemented using the hardware independent OpenCL library, we 
kept the description as library independent as possible. For this reason, and to make the paper self- 
contained, we first define some common terms (and refer to Figure [2]for a block scheme of modem 
GPU architecture): 

Kernel — a kernel is a piece of code which is executed by the GPU in parallel on several indepen¬ 
dent hardware units; 

Work-item — the execution of a kernel on a single data package is called a work-item; 

Work-group — work-items can be grouped into work-groups; all work-items of one work-group 
share data in local memory; 

Global memory — global GPU memory can be accessed by all work-items. It is slower than local 
GPU memory; 

Local memory — local GPU memory is shared by the work-items of the same work-group. It is 
faster than global GPU memory; 


RAM — the computer’s (classical) main memory used by the CPU. It cannot be directly accessed 
by work-items, thus code running on the GPU. 



Figure 2: The architecture of a CPU+GPU system as seen by the hardware independent OpenCL 
library. 


Given the restrictions of memory access and execution on the GPU, parallel algorithms follow a 
classical three-step process: 

• transfer data from RAM to global GPU memory; 

• eventually transfer data from global GPU memory to different local GPU memory banks; 

• execute the kernel multiple times in parallel on the GPU; 

• transfer results from GPU memory to RAM 

Dependencies between results may require more complex schemes and/or multiple iterations of this 
process. 
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Figure 3: A visualization of the 3D trellis of Equation (10 1 as a series of 2D tables of size 5x5. 
Each cell in table a* is a result of a minimization operation taking as input results from a row in 
table «i+i. 


3.1 Defining the GPU kernel 

The recursion on i in Equation ( [TO] ) produces (and works on) a 3D structure with 3 coordinates : 
i, Zi- 1 , Zi- 2 . Flere, i is the model frame/node index, which can also be interpreted as a temporal 
coordinate, and which takes values in {1..M}; Zi -1 and Zi -2 are assignment variables which can 
each take values in {1..5}. It is convenient to visualize this trellis as a series of 2D tables, as 
illustrated in Figure [3] The kernel code will execute the minimization in Equation (jTO]). where a 
single work-item deals with a triplet of parameters given by i, z t -\ and Zi- 2 . Accordingly, a total 
number of M 5 2 kernel executions (work-items) is required to fill the trellis. 

The dependencies in Equation (T0| ) make it impossible to execute all kernels in parallel: it 
is easy to see, that a full row in a * + 1 is required as input for each cell in cti. A scheduling is 
needed, which executes kernels in parallel with a block-wise sequential ordering, insuring that a 
whole ctj+i row is available, before the corresponding cell in ct* is executed. This scheduling could 
theoretically be done by the kernel itself through synchronization points. However, kernel-wise 
synchronization is only possible for work-items of the same work-group. The amount of work- 
items in the studied problem (MS 2 ) makes it unreasonable to use a single work-group for the 
whole problem. Indeed, convenient values are M=30 and S=60. This corresponds to 1 seconds of 
model video and 2 seconds of scene video at 30 frames per second, using one node per frame. The 
amount of work items involved in the matching of the model graph to the scene graph, MS 2 , is 
then 108000. This value is far above the capacity of even a high end GPU like the GTX580, where 
a workgoup is limited to 1024 work items. If we match a model against larger blocks of the scene, 
for instance entire videos, then this value will be even higher (see also table [2] for comparisons of 
two different scenarios with 5=754 and 5=60). 

A different solution is to perform synchronization through the CPU, i.e. launch parallel execu¬ 
tions of a single array a :., + 1 on the GPU with the CPU taking over control between iterations over 
i. In other words, at the end of the kernel execution, the fall back to CPU execution acts as a syn- 
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Figure 4: The 3D trellis of size MxS 2 calculated by the recursive minimization algorithm. The 
shaded cross-section of size ~ MxSxT in the middle corresponds to the admissible combinations 

of i, Zi -1 and Zi- 2 . 


eh ionisation point for all work-items. Note, that the resulting scheduling is not different; a slight 
performance loss is due to the control flow change between the GPU and the CPU. This leads to a 
two-dimensional kernel, the dimensions being zi-\ and Zi- 2 . A single loop in the kernel iterates 
over z, to perform the minimization operation in (fT0|). 


3.2 Pruning the trellis 


The causality assumption ([7]) and the temporal closeness assumption (j 8 ]) restrict the admissible 
combinations of Zi, z,;_ ] and Z{- 2 , and therefore allow us to prune combinations in the trellis. In 
particular', for a given value of Zj_i, the value of z h - 2 is restricted to the interval ]z,;_ i — T, Zi- 1 [. 
This limits the admissible values of the trellis to a cross section of size ~ SxT in the 3D volume, 
as illustrated in Figure [4] The validity tests for these constraints are precomputed and stored in a 


boolean array. A similar reasoning restricts the values of z^ c.f section 3.4 


3.3 Precomputing the unary terms 


The complexity of the unoptimized kernel is dominated by the calculation of the unary terms U ( Zi ), 
which are calculated in each iteration of the minimization loop, and which correspond to the Eu¬ 
clidean distance between the feature vector of a model node and the feature vector of a scene node. 
The size F of the appearance feature vector may vary, common sizes are 50 to 200 components 
(162 in the case of our HoG/HoF features). 

Assumptions (|7]l and ([8 1 decrease the part of U for a work-item from O(FS), to 0{FT), where 
T<S (see also sub section 3.4). The unoptimized computational complexity for a whole array a t 
is 0(FST 2 ) and for the whole trellis it is 0(FSMT 2 ). Another way to derive the same result is to 
consider that each minimization in equation (jTO]) takes place over T iterations, that the minimization 
is performed SMT times (c.f. the blue cross section in figure[3]), and that one term is of O(F). 

The unary terms take as input the model node and scene node. Therefore, out of the SMT 2 
calls to U, only SM different input arguments exist, which can be pre-calculated on the GPU by 
a separate kernel. The pre-computed unary terms are stored in a look-up table of size 5 x Min 
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Figure 5: Dependencies: one column of array a, is handled by the work-items of the same work¬ 
group. The values of this column depend on a single row of previous array a,;+i. 

the GPU global memory. They are later used by the main GPU kernel, which calculates one ctj 
array. When the a* kernel evaluates the matching of model node i to scene node z r , it reads the 
corresponding pre-computed value of U in the look-up table. 

Precomputing U for all combinations of model nodes and scene nodes saves T 2 calls to U. 
The pre-computations themselves can be done in O(FSM), so total computational complexity is 
reduced from 0(FSMT 2 ) to 0(SMT 2 + FSM ). For typical values of T=10 and F=162, the 
speed up factor is 61. 

3.4 Precomputing the minimization loop boundaries 

The minimization loop involved in equation ( p~Q] > is executed inside each kernel. During one kernel 
execution, i.e. for one work-item, and z ) _2 have a fixed value. The loop is then executed 
over Zi, whose value arc limited by z t -\ and Zi- 2 , according to causal constraint ([7]) and tempo¬ 
ral closeness constraint ([ 8 ]). Applying the two constraints above and performing some algebraic 
simplifications results in two equations involving zf. 

t\Zi)>t'( Zi -l) 

t'(zi) < t'(zi — 2) + T 

Equations < [T4| ) define an interval in temporal coordinates, which needs to be translated into an 
interval in scene nodes. Let us recall that the minimization in equation ( fTUj ) is over possible scene 
nodes. As opposed to model graphs, scenes can feature multiple nodes per frame, as illustrated in 
figure ([ 6 ]). This distribution of nodes over frames can be pre-calculated and stored in a table: 

minnode(f) = inf({n : t'(n) = f}),n = 1...S 

where minnode(f) gives the first node of a given scene frame /, assuming that the scene nodes 
are sorted in temporal (i.e. frame) order. Then, the boundaries of the minimization loop in equation 
<0 can be directly derived as follows: 

min (zi )=minnode (t'(zi-i)) 

m&x(zi)=minnode(t' (zi- 2 ) + T) — 1 

where we took advantage of the fact that the maximum node of a frame / is equal to minnode(f + 
1 ) " 1- 

Function minnode(.) only depends on the distribution of the nodes in the scene. It can be 
pre-computed for each scene block, stored in GPU memory, and then shared by all work-items. 
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Figure 6: Example of a distribution of scene nodes over scene frames: as opposed to models, scenes 
may feature multiple nodes per frame. For example, frame 8 has nodes 9, 10 and 11. 


Algorithm 1: The kernel executed for each work-item. 


Kernel 

Copy a cell of ctj+i from global mem. to local mem.; 
Wait for synchronization between all work-items; 
Compute one cell of array «:. t (minimization in (101); 

end 


3.5 Work-groups and local GPU memory 

Local GPU memory (NVidia “shared memory” or OpenCL “local memory”) offers faster access 
than global GPU memory. Local memory is associated to a work-group, i.e., only work-items of 
this work-group can access it. Read and write operations to local memory can be synchronized 
between work-items of a single work-group. The goal is to optimize performance by performing a 
single transfer from global memory to local memory, followed by several accesses to local memory 
directly done by the kernels executions. Efficiency is guaranteed by organizing work-groups such 
that work-items can share common data. 

As illustrated in Figure [5] all cells of a single column of an array a, depend on the same row 
of array oti +This can easily be derived from the recursive Equation ( [T()| ): calculating one column 
Zi- 2 ) for fixed % and z r -\ requires the values a t+ 1 {z % , z r -\) for fixed i + 1 and zi-\ over 
varying Zi, which corresponds to a row in the precedently calculated array. This naturally leads 
to a configuration where one column of the currently calculated array a t is organized in a single 
work-group, where the column corresponds to the expression ati(zi- 1 , Zi- 2 ) for fixed i and z,;_ 1 . 

The memory transfer of a whole row for a single workgroup is distributed over its work-items. 
As all tables cti are of square shape (the number of columns equals the number of rows), a single 
value is transferred by each work-item at the beginning of the kernel execution. After a synchroni¬ 
sation point, the minimization routine itself is executed. This results in the kernel structure given 
in Algorithm [T] 

As a result, S 2 read operations from slow global memory per work-group have been replaced by 
S read operations from global memory plus S 2 read operations from fast local memory. The total 
number of operations slightly grows by a small factor of (S + S 2 ) / S 2 , which in practice is smaller 
than 1.017 for the configuration we employed (S'=60). However, the total access time decreases 
due to faster memory accesses. 
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Algorithm 2: The CPU side algorithm. 

Copy full data from RAM to global GPU memory; 
for i=M to 1 do 

In parallel do 

Compute array a t on GPU (run kernel); 

if i < M then 

Copy array «:, +1 from GPU to RAM; 

end 

end 

end 

Copy array a\ from GPU to RAM; 


3.6 GPU memory persistence 

All model and scene data must be transfered from the CPU controlled RAM to the GPU, and 
hardware constraints force a transfer to global GPU memory. Let us recall that after a parallel 
computation of an a nay o.,, execution falls back to the CPU, which controls the iterations over the 
different arrays a* for different indices i. This could potentially involve a large number of memory 
transfers between CPU RAM and global GPU memory, as data needs to be exchanged between the 
CPU and the GPU before and after each iteration. 

Data is persistent in global GPU memory between kernel executions, which can be exploited to 
minimize the number of memory transfers. The full model data and the currently processed scene 
block are initially entirely transferred to the global GPU memory, and stay persistent between 
iterations over the tables a t . The a* arrays also stay in GPU memory between successive kernel 
executions. 

3.7 Parallelizing calculations and transfer 

After calculation of the trellis, the optimal point assignment is calculated with the backtracking step 
given in Equation < [T2| ). This step is of very low complexity (a single loop over M iterations) and 
inherently sequential, therefore it is performed on the CPU. The whole trellis (tables ai(zi-i, Zi- 2 ) 
for i E [1, M]) must be transferred from global GPU memory to the CPU accessible RAM. Modern 
GPU architectures provide the ability to parallelize kernel executions and data transfer, which is 
here exploited to copy the results available in array a - l+ 1 while kernel executions are computing the 
entries of ctj. The CPU side of the matching algorithm is given in Algorithm [2j 

3.8 Memory structure optimizations 

An important concept in efficient computing on most GPU architectures is to ensure coalesced 
accesses to global GPU memory. Practically, this means that data in the global GPU memory has to 
be organised in contiguous blocks of multiples of 128 bytes to maximize access rate inside work- 
items. As a whole row of table ccj+i needs to be read for each cell (work-item) in on, we store the 
different tables a* in row order. 

3.9 Computational complexity and run-time 

As mentioned in section |2.2[ taking advantage of all approximations results in a computational 
complexity of 0(SMT' 2 ), where T is a small constant (T=10 in our experiments). However, that 
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Work 

Average perf. 

Run-time 

Evaluation 

Remarks on run-time 

Ta et al. Il39l 

91.2% 

1.86s 

LOOCV 

s/frame, matching with 98 
model graphs 

Borzeshi et al. f5| 

70.2% 

N/A 

Split 8/8/9 


Brendel & Todorovic [6] 

N/A 

10s 

N/A 

Matching 1000 nodes graph 
with 2000+ nodes graph 

Lv & Nevada |30l 

N/A 

5.1s 

N/A 

s/frame 

Savarese et al. 1351 

86.8% 

N/A 

LOOCV 


Ryoo & Aggarwal lf34l 

93.8% 

N/A 

LOOCV 


Mikolajczyk & Uemura 11311 

95.3% 

5.5s to 8s 

LOOCV 

s/frame (-5s if SVM are not 
used) 

Baccouche et al. J3| 

95.8 

N/A 

LOOCV 

N/A 

Jiang et al. fl9l 

93.4% 

N/A 

LOOCV 


Our method on CPU 

91.0% 

0.2s 

Split 8/8/9 

s/frame, matching 754- 
nodes scene with 50 model 
graphs 

Our method on GPU 

91.0% 

0.02s 

Split 8/8/9 

s/frame, matching 754- 
nodes scene with 50 model 
graphs 

Our method on GPU 

91.0% 

0.0035s 

Split 8/8/9 

s/frame, matching 60-nodes 
scene with 50 model graphs 


Table 1: Comparison with the state-of-the-art methods on the KTH database (LOOCV = Leave- 
one-out-cross validation). 


does not necessarily mean that run-time is a function of 0{SMT 2 ), as parallel processing is not 
yet taken into account. Actual run-time is of course considerably lower. If we consider Q work- 
units (cores), run-time is of 0{\M kernel calls are launched, each one performing a 
minimization over T iterations, and each kernel call scheduling ST work-items. In the case where 
enough work-units are available on the GPU to perform all ST work-items of one call in parallel, 
i.e. Q > ST, run-time is actually of O(MT). 

4 Experimental Results 

The presented matching algorithm has been evaluated on a real-world application, namely action 
recognition in video sequences. The widely used public KTH database ll36Tl has been chosen as 
test dataset. It contains 25 subjects performing 6 actions {walking, jogging, running, handwaving, 
handclapping and boxing ) recorded in four different scenarios including indoor/outdoor scenes and 
different camera viewpoints, totally 599 video sequences (one is corrupted). The subdivision of 
the sequences is the same as the one provided on the official dataset websiteQ This results in 2391 
subsequences in total. 

The images of the KTH database are of size 160x120. However, only the preprocessing steps 
depend on the image size, the complexity of the matching algorithm itself is independent of it. 

We use spatio-temporal interest points extracted with the 3D Harris detector ll23ll to constitute 
the nodes of the graph. Appearance features are the well-known HoG-HoF features extracted with 
the publicly available code in li23ll . The weighting parameters are set so that each distortion measure 

'http://www.nada.kth.se/cvap/actions/ 
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Implementation 

#Scene 

nodes 

#Scene 

frames 

Time/single model 
Tot(ms) per fr(ms) 

All 50 models 
Time/fr (ms) 

CPU: Intel Core 2 Duo, 

E8600 @ 3.33 Ghz, 
Matlab/C(mex) 

754 

723 

2900 

4.01 

200.5 


Nvidia GeForce GTS450, 

754 

723 

748 

1.03 

51.5 


192 cuda cores @ 1566 MHz, 
mem bandwidth 21.3 GB/sec 

60 

55 

4 

0.07 

3.5 

real time! 

Nvidia GeForce GTX560, 

754 

723 

405 

0.56 

28 

real time! 

336 cuda cores @ 1660 MHz, 
mem bandwidth 128 GB/sec 

60 

55 

4 

0.07 

3.5 

real time! 

Nvidia GeForce GTX580, 

754 

723 

178 

0.25 

21 

real time! 

512 cuda cores @ 1544 MHz, 
mem bandwidth 192 GB/sec 

60 

55 

4 

0.07 

3.5 

real time! 


Table 2: Running times in milliseconds for two different GPUs and for two different scene block 
sizes. The last column on the right gives times per frame for matching the whole set of 50 model 
graphs. The bold blue value of 178 ms is comparable to the values in table [3] 


has the same range of values: Ai=0.6, A 2 = 0 . 2 , A 3 = 5 , T=10 (A 3 is explained in the appendix). 

All experiments use the leave-one-subject-out (LOSO) strategy. We augmented the number of 
model graph prototypes by selecting different sub-sequences and constructed each graph consisting 
of 20 to 30 frames containing at least one or more salient interest points. Action classes on the 
unseen subjects are recognized with a nearest prototype classifier (NPC). The distance between 
model and prototypes is based on the matching energy given in Equation (jT|). However, experiments 
showed that best performance is obtained if only the appearance terms U(.) are used for distance 
calculation instead of the full energy ([!]). 

We learned a set of discriminative prototypes using sequential floating forward search (SFFS) 
lf32l : model graph prototypes are created from the training subjects and the prototype selection is 
optimized over the validation set. 

The maximum admissible size of each work-group depends on the GPU itself and on the kernel 
specifically implemented for the problem. In our case, and for the Nvidia GeForce GTX 560 GPU, 
the limit is 768 work-items per work-group, which in practise is higher than the number of work- 
items we need: 5=60 if scene blocks of 2 seconds are matched in video streaming mode, and 
5=754 if whole scene videos of 30 seconds are matched. 

Floating point operations on Nvidia GPUs use different rounding algorithms than the FPUs on 
Intel CPUs, which may results in slightly different values obtained in the trellis. However, this did 
not impact the recognition results. 

We would like to point out that many results have been published on the KTH database, but 
most of the results cannot be compared due to different evaluation protocols, as has been studied on 
the detailed report on the KTH database in lfl5ll . For completeness, we compare our performance 
with state-of-the-art methods. In Table [T] we report average action recognition performance and 
computational time for the aforementioned methods in Section [T] The results have been copied 
from the original papers. Although run-time calculations and used protocols differ between the 
papers, this table gives an overall idea that our proposed method is extremely competitive in terms 
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Implementation Optimizations Time 

(ms) 


CPU: Intel Core 2 Duo, 

3.2 

3.3 

2900 

E8600 @ 3.33 Ghz, 




Matlab/C(mex) 




Basic GPU Kernel 

3.2 

3.3 

794 


Nvidia GeForce GTX580 


GPU Kernel with all alg. optimizations 
Nvidia GeForce GTX580 


3.2 

3.3 

+ 

3.4 


326 


GPU Kernel with all alg. and archit. optimizations 

3.2 

3.3 

3.4 

+ 

178 

Nvidia GeForce GTX580 

3.5 

3.6 

3.7 

3.8 


GPU Kernel with all alg. and archit. optimizations 

3.2 

3.3 

3.4 

+ 

1853 

Nvidia GeForce GTX580 

3.5 

3.6 

3.7 

3.8 



(No closeness constraint: T =oo) 


Table 3: Contribution of different optimizations on execution time. Matching one model graph of 
30 nodes to one scene graph of 754 nodes and 723 frames. 

Table 4: Comparison of different model types with and without restriction to a single point per 
frame (M: number of model nodes, S: number of scene nodes, M: number of model frames, S: 
number of scene frames, R: maximum number of interest points per frame in the scene sequence, 
N = 3: number of matched single-chain-single-point model). 


Method 

Complexity 

Accuracy 

Proposed method (1 point per frame) 

0(SMT 2 ) 

91.0% 

N=3 points per frame, method 1 (greedy) 

0(SMT 2 R n ) 

90.2% 

N=3 points per frame, method 2 (independent chains) 

o\smt 2 n) 

90.8% 


of run-time with at the same time providing recognition performance comparable to the state of 
art. The best performing methods require several seconds of calculation per frame, whereas our 
method requires only 0.0035 seconds per frame. Our GPU algorithm is faster in several orders of 
magnitude. 

The GPU implementation allows faster than real-time performance on standard medium end 
GPUs, e.g. a Nvidia GeForce GTS450. Table [2] compares run-times of the CPU implementation in 
Matlab/C (the critical sections were implemented in C) and the GPU implementation running on 
different GPUs with different characteristics, especially the number of calculation units. The run¬ 
times are given for matching a single model graph with 30 nodes against scene blocks of different 
lengths. If the scene video is cut into smaller blocks of 60 frames, which is necessary for continuous 
video processing, real time performance can be achieved even on the low end GPU model. With 
these smaller chunks of scene data, matching all 50 graph models to a block of 60 frames (roughly 
2 seconds of video) takes roughly 3 ms regardless of the GPU model. 

We can observe that the results for the three different GPUs (GTS450, GTX560 and GTX580) 
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are identical when models are matched against scenes of 60 nodes. In this case, the number of cuda 
cores is higher than the amount of requested work items per call, i.e. per frame. Therefore, adding 
additional cuda cores does not give us an increase in performance. Other architectural differences 
(frequency, bandwidth) between the video cards are negligible and beyond the precision of time 
measurement. If we match one model against a whole video (754 frames), which requires much 
more work items, then the extra amount of cuda cores of the GTX560 and GTX580 cards makes a 
large difference in performance. 

The processing time of 3 ms per frame is very much lower than the limit for real time pro¬ 
cessing, which is 40 ms for video acquired at 25 frames per second. Additional processing will 
be required in order to treat overlapping blocks, which increases running time to 6 ms per frame. 
The times given above also do not include interest point detection and feature extraction, but fast 
implementations of these steps do exist. As mentioned, in our experiments we used Laptev et al’s 
STIP points ll23ll . which performed slightly better than other detectors. Its pure CPU implemen¬ 
tation requires 82 ms per frame. However, our own implementation of Dollar et al.’s points iflOl 
runs in 2 ms per frame including the detection of HoG features (also on CPU). These run-times 
can be decreased porting the algorithms to GPU, especially Dollar et al.s detector, which essen¬ 
tially proceeds through linear filtering in time and space. The convolution operations can be easily 
performed on GPUs. 

In table [3] we compare execution times between the CPU implementation of the matching al¬ 
gorithm, and several implementations on GPU, with different types of optimizations. Comparing 
the CPU version running time (2900 ms) with the non-optimized GPU version (794 ms), we can 
see that the parallel hardware architecture of the GPU is almost 4 times faster than the CPU. The 
algorithmically optimized version of the GPU kernel (including optimizations described in sections 
3.2[ 3.3[ 3.4 1 takes 326 ms to run, and is then 2.4 times faster than the non-optimized version. Fi¬ 


nally, the GPU architectural optimizations described in sections |33j|3.6||3.7[|3.8| bring an additional 
speedup of factor 2 (178 ms vs. 326 ms). 

The last line in the same table (table [3]) shows a comparison with the fully optimized GPU 
version without temporal closeness constraint (1853 ms), giving the influence of the restriction 
of temporal warping. Figure [7] shows this effect of parameter T (restricting temporal waip) on 
performance. The parameter retained for all other experiments is T=10, and at this configuration 
we find the value of 178 ms also given in tables [2] and [3] Loosening the restrictions will not only 
significantly increase computational complexity, it will also decrease recognition performance. The 
restriction due to the parameter T allows to remove wrong assignments. No restrictions (T = +oo) 
will lead to a runtime of 1853 ms. 

We also performed experiments to check the influence of the restriction to a single interest point 
per frame. The original formulation given in equation ([!]) is of polynomial complexity, but still too 
complex to solved in real time. We compared the proposed approximation to two different models 
allowing several interest points per frame : 


Multiple points 1: a greedy approximation of the the full model, where interest points assignments 
are decoupled from frame assignments. First, for each model frame and for all possible 
assigned scene frames, the optimal interest point assignments are made based on unary terms 
U and inter-frame deformation terms D. Then, frames are assigned using the proposed 
model. 


Multiple points 2: creation of several single point models (several second order chains), each of 
which is solved independently. The resulting matching distance is given as the average over 
the different chains. 
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(a) (b) 

Figure 7: Performance as a function of parameter T: (a) run-time; (b) classification performance 
(accuracy). No restrictions (T = +oo) will lead to a runtime of 1853 ms (Nvidia GTX580, 754 
scene nodes). 


More details on these models arc given in [8] - Results on the KTH are given in table [4] Results 
of the multiple point methods are comparable, even slightly lower. After investigating the results 
in detail, we believe that taking into account more points hurts performance because this decreases 
the invariance properties of the method. Space-time points are naturally very sparse, and taking 
more points actually leads to including less stable and less robust points. 

5 Conclusion 

An efficient parallel algorithm for activity recognition has been presented, which resorts to parallel 
hypergraph matching on GPUs. The traditional problem of irregular computation flow in graph 
matching has been addressed by restricting the graph to a regular structure, which allows efficient 
inference by a recursive function working on a 3D trellis. The values in the trellis are computed 
in a block-wise parallel manner. The method is competitive with the state of the art in terms of 
recognition performance while at the same time being faster in several orders of magnitude. The 
current implementation is faster than real time on medium-end GPUs even though only a part of the 
computing units are currently used. Further speed-up could be gained by matching a scene video 
block against several model graphs in parallel and by distributing the cells of the multiple trellis 
over the computing units of the GPU. 
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